8 research outputs found

    Approximating a similarity matrix by a latent class model: A reappraisal of additive fuzzy clustering

    Get PDF
    Let Q be a given n×n square symmetric matrix of nonnegative elements between 0 and 1, similarities. Fuzzy clustering results in fuzzy assignment of individuals to K clusters. In additive fuzzy clustering, the n×K fuzzy memberships matrix P is found by least-squares approximation of the off-diagonal elements of Q by inner products of rows of P. By contrast, kernelized fuzzy c-means is not least-squares and requires an additional fuzziness parameter. The aim is to popularize additive fuzzy clustering by interpreting it as a latent class model, whereby the elements of Q are modeled as the probability that two individuals share the same class on the basis of the assignment probability matrix P. Two new algorithms are provided, a brute force genetic algorithm (differential evolution) and an iterative row-wise quadratic programming algorithm of which the latter is the more effective. Simulations showed that (1) the method usually has a unique solution, except in special cases, (2) both algorithms reached this solution from random restarts and (3) the number of clusters can be well estimated by AIC. Additive fuzzy clustering is computationally efficient and combines attractive features of both the vector model and the cluster mode

    The use of multiple hierarchically independent gene ontology terms in gene function prediction and genome annotation

    Get PDF
    The Gene Ontology (GO) is a widely used controlled vocabulary for the description of gene function. In this study we quantify the usage of multiple and hierarchically independent GO terms in the curated genome annotations of seven well-studied species. In most genomes, significant proportions (6 - 60%) of genes have been annotated with multiple and hierarchically independent terms. This may be necessary to attain adequate specificity of description. One noticeable exception is Arabidopsis thaliana, in which genes are much less frequently annotated with multiple terms (6 - 14%). In contrast, an analysis of the occurrence of InterPro hits in the proteomes of the seven species, followed by a mapping of the hits to GO terms, did not reveal an aberrant pattern for the A. thaliana genome. This study shows the widespread usage of multiple hierarchically independent GO terms in the functional annotation of genes. By consequence, probabilistic methods that aim to predict gene function automatically through integration of diverse genomic datasets, and that employ the GO, must be able to predict such multiple terms. We attribute the low frequency with which multiple GO terms are used in Arabidopsis to deviating practices in the genome annotation and curation process between communities of annotators. This may bias genome-scale comparisons of gene function between different species. GO term assignment should therefore be performed according to strictly similar rules and standards

    Bayesian Markov random field analysis for integrated network-based protein function prediction

    No full text
    Unravelling the functions of proteins is one of the most important aims of modern biology. Experimental inference of protein function is expensive and not scalable to large datasets. In this thesis a probabilistic method for protein function prediction is presented that integrates different types of data such as sequences and networks. The method is based on Bayesian Markov Random Field (BMRF) analysis. BMRF was initially applied to genome wide protein function prediction using network data in yeast and in also in Arabidopsis by integrating protein domains (i.e InterPro signatures), expressions and protein protein interactions. Several of the predictions were confirmed by experimental evidence. Further, an evolutionary discrete optimization algorithm is presented that integrates function predictions from different Gene Ontology (GO) terms to a single prediction that is consistent to the True Path Rule as imposed by the GO Directed Acyclic Graph. This integration leads to predictions that are easy to be interpreted. Evaluation of of this algorithm using Arabidopsis data showed that the prediction performance is improved, compared to single GO term predictions. </p

    A comparative fluctuating asymmetry study between two walnut (Juglans regia L.) populations may contribute as an early signal for bio-monitoring

    No full text
    Developmental stability, the ability of an individual to eliminate environmental disturbances while expressing a heritable phenotypic trait, was compared in two walnut (Juglans regia L.) populations, a natural and an artificial. Bilateral leaf morphometrics were used to estimate fluctuating asymmetry which refers to random deviation from perfect symmetry of bilateral traits resulting from extrinsic and intrinsic perturbations not buffered during development. Fluctuating asymmetry was used as a proxy of developmental stability. We analyzed our data from a Bayesian perspective showing that developmental stability levels are decreased in the natural population. Our results indicate that an attention may be directed towards the conservation of the natural walnut resources of the area. Fluctuating asymmetry as an indicator of developmental stability may contribute especially in the framework of comparative studies as a population biomonitoring too

    Approximating a similarity matrix by a latent class model: A reappraisal of additive fuzzy clustering

    No full text
    Let Q be a given n×n square symmetric matrix of nonnegative elements between 0 and 1, similarities. Fuzzy clustering results in fuzzy assignment of individuals to K clusters. In additive fuzzy clustering, the n×K fuzzy memberships matrix P is found by least-squares approximation of the off-diagonal elements of Q by inner products of rows of P. By contrast, kernelized fuzzy c-means is not least-squares and requires an additional fuzziness parameter. The aim is to popularize additive fuzzy clustering by interpreting it as a latent class model, whereby the elements of Q are modeled as the probability that two individuals share the same class on the basis of the assignment probability matrix P. Two new algorithms are provided, a brute force genetic algorithm (differential evolution) and an iterative row-wise quadratic programming algorithm of which the latter is the more effective. Simulations showed that (1) the method usually has a unique solution, except in special cases, (2) both algorithms reached this solution from random restarts and (3) the number of clusters can be well estimated by AIC. Additive fuzzy clustering is computationally efficient and combines attractive features of both the vector model and the cluster mode

    The turbulent life of Sirevirus retrotransposons and the evolution of the maize genome: more than ten thousand elements tell the story

    No full text
    Sireviruses are one of the three genera of Copia long terminal repeat (LTR) retrotransposons, exclusive to and highly abundant in plants, and with a unique, among retrotransposons, genome structure. Yet, perhaps due to the few references to the Sirevirus origin of some families, compounded by the difficulty in correctly assigning retrotransposon families into genera, Sireviruses have hardly featured in recent research. As a result, analysis at this key level of classification and details of their colonization and impact on plant genomes are currently lacking. Recently, however, it became possible to accurately assign elements from diverse families to this genus in one step, based on highly conserved sequence motifs. Hence, Sirevirus dynamics in the relatively obese maize genome can now be comprehensively studied. Overall, we identified >10 600 intact and approximately 28 000 degenerate Sirevirus elements from a plethora of families, some brought into the genus for the first time. Sireviruses make up approximately 90% of the Copia population and it is the only genus that has successfully infiltrated the genome, possibly by experiencing intense amplification during the last 600 000 years, while being constantly recycled by host mechanisms. They accumulate in chromosome-distal gene-rich areas, where they insert in between gene islands, mainly in preferred zones within their own genomes. Sirevirus LTRs are heavily methylated, while there is evidence for a palindromic consensus target sequence. This work brings Sireviruses in the spotlight, elucidating their lifestyle and history, and suggesting their crucial role in the current genomic make-up of maize, and possibly other plant host

    Genome-wide computational function prediction of Arabidopsis thaliana proteins by integration of multiple data sources

    No full text
    Although Arabidopsis thaliana is the best studied plant species, the biological role of one third of its proteins is still unknown. We developed a probabilistic protein function prediction method that integrates information from sequences, protein-protein interactions and gene expression. The method was applied to proteins from Arabidopsis thaliana. Evaluation of prediction performance showed that our method has improved performance compared to single source-based prediction approaches and two existing integration approaches. An innovative feature of our method is that enables transfer of functional information between proteins that are not directly associated with each other. We provide novel function predictions for 5,807 proteins. Recent experimental studies confirmed several of the predictions. We highlight these in detail for proteins predicted to be involved in flowering and floral organ development

    The use of multiple hierarchically independent gene ontology terms in gene function prediction and genome annotation

    No full text
    The Gene Ontology (GO) is a widely used controlled vocabulary for the description of gene function. In this study we quantify the usage of multiple and hierarchically independent GO terms in the curated genome annotations of seven well-studied species. In most genomes, significant proportions (6 - 60%) of genes have been annotated with multiple and hierarchically independent terms. This may be necessary to attain adequate specificity of description. One noticeable exception is Arabidopsis thaliana, in which genes are much less frequently annotated with multiple terms (6 - 14%). In contrast, an analysis of the occurrence of InterPro hits in the proteomes of the seven species, followed by a mapping of the hits to GO terms, did not reveal an aberrant pattern for the A. thaliana genome. This study shows the widespread usage of multiple hierarchically independent GO terms in the functional annotation of genes. By consequence, probabilistic methods that aim to predict gene function automatically through integration of diverse genomic datasets, and that employ the GO, must be able to predict such multiple terms. We attribute the low frequency with which multiple GO terms are used in Arabidopsis to deviating practices in the genome annotation and curation process between communities of annotators. This may bias genome-scale comparisons of gene function between different species. GO term assignment should therefore be performed according to strictly similar rules and standards
    corecore